Starbucks Capstone Challenge

A capstone project as part of my Udacity Data Scientist Nanodegree Program

Project Overview

The objective of this project is to do a detailed analysis of the simulated data provided by Udacity (simplified version of the real Starbucks app data) that mimics customer behavior on the Starbucks rewards mobile app, along with a machine learning model that predicts whether a customer will respond to an offer sent to respective users.

Once every few days, Starbucks sends out an offer to its users' mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks. ALso, not all users receive the same offer, and that is the challenge to solve with this data set.

The goal here is to combine transaction, demographic and offer data to determine which demographic groups respond best to which offer type. This data set is a simplified version of the real Starbucks app because the underlying simulator only has one product whereas Starbucks actually sells dozens of products.

Every offer has a validity period before the offer expires. As an example, a BOGO offer might be valid for only 5 days. Data set that informational offers have a validity period even though these ads are merely providing information about a product; for example, if an informational offer has 7 days of validity, it can be assumed the customer is feeling the influence of the offer for 7 days after receiving the advertisement.

Also a transactional data is provided that contains user purchases made on the app including the timestamp of purchase and the amount of money spent on a purchase. This transactional data also has a record for each offer that a user receives as well as a record for when a user actually views the offer. There are also records for when a user completes an offer.

A blog post for this project explaining the observations from all analysis and about data modelling is publihed on Medium Platform

Example

To give an example, a user could receive a discount offer buy 10 dollars get 2 off on Monday. The offer is valid for 10 days from receipt. If the customer accumulates at least 10 dollars in purchases during the validity period, the customer completes the offer.

However, there are a few things to watch out for in this data set. Customers do not opt into the offers that they receive; in other words, a user can receive an offer, never actually view the offer, and still complete the offer. For example, a user might receive the "buy 10 dollars get 2 dollars off offer", but the user never opens the offer during the 10 day validity period. The customer spends 15 dollars during those ten days. There will be an offer completion record in the data set; however, the customer was not influenced by the offer because the customer never viewed the offer.

Problem Statement

The problem statement that we will be looking to solve for the this project is to combine transaction, demographic and offer data and to analyze Starbucks mobile app data on how users respond to different offers and predict whether a user will respond to an offer or not using the demographics and offer reward data. Below are the problem statements for which we need to find the answer as part of this project:

  1. What percentage of offers viewed by the users from all the offers sent to them?
  2. How many offers from each offer type were sent to Customers?
  3. What percentage of Customers completed an offer after viewing it and what percentage of Customers completed an offer without viewing it?
  4. How many transactions were completed/influenced because of the sent offers?
  5. Find out the facts that are influenced by the offer?

    a. how gender affects the amount spent in a transaction and which type of discount attracts which gender types.

    b. find out the spread of data for amount spent by the users who completed the offer after viewing it and before it expires, and for users who didn't complete the offer.

    c. analyze the correlation between Age Groups and Amount spent by customers for completed offers transactions with and without offers.

    d. find the correlation between Customer Income and Amount Spent by customers for the transactions with and without completed offers.

    e. how age plays a role in responding to offer rewards.

  6. How customers responded to offers for the different advertisement channels used for both completed and uncomplete offers?

  7. Predict whether a Customer will respond to an offer or not using RandomForestClassifier and LinearSVC ?

Metrics

The F1 score is chosen to be the metric as it defines the harmonic mean of precision and recall taking both metrics into account.

Data Understanding

Import required libraries

Read the Data Sets

The data is contained in three files:

Lets understand each of these data sets in detail.

Understanding offer data from portfolio data set

portfolio.json

Following are 3 different offer types available in the dataset.

Understanding Demographic data from profile data set

Demographic data for customers is provided in the profile dataset. The schema and variables are as follows:

profile.json

Understanding transcript data set

The schema for the transactional data is as follows:

transcript.json

There are 4 unique events are available in the data set i.e. transaction,offer received, offer viewed, offer completed

Number of people in transcript data set are the same as the number of people in the Demographics Data. This would be easy to combine the datasets

Data Preparation

Now that we understood the data sets, lets proceed to the next stage i.e. data preparation/data wrangling of all data sets.

Data Wrangling of Portfolio data set

Handling of categorical variables

Data Wrangling of profile data set

Data Wrangling of transcript data set

Data Exploration

Let's perform some data exploration using various visualization techniques to gain insights about the datas ets.

Find out the distribution of our users based on their age and gender.

There is a big spike around 120 level. So it's evident from the chart that the distribution is more for customers with age around 120 i.e. more number of customer entries are there in the data set with age around 120 years.

This is because there are same users who selected their gender as O (which is other than male or female) and they all have an exact age of 118.

Income for those people are also have Null entries in the dataset as shown below.

It's concluded that there are 2175 users of age 118 years who don't prefer to share their personal info like gender and income details. So let's separate those users to a new data frame. Add a new column that identifies the user those who provided any info with value as 0 else 1

Customer Distribution based on user info provided

Let's repeat the exploration of the age distribution by excluding the users who didn't provide any user info.

Distribution of age for the customers for which user info is available

From the above visualization though for all customers, user info is available, majority of customers in the dataset are in the groups of late 50's or early 60's, and no of customers decreases as we move away from the peak i.e. age after 65 years.

Next lets look at the gender distribution by filtering out the customers those who didn't provide any personal info.

After filtering out the customers who have not provided any personal info, its observed that there are 8484 male customers which is 57.23% of the total provided personal info. And less than 16% Female customers than Males have provided personal info and other categories provided 1.43%.

Now Let's look at the distribution of user incomes for customers who provided personal infos.

After filtering out the customers who have not provided any personal info, its observed that many users have an annual income in the range between 30000 USD and 50000 USD and majority of the customers having income in the range between 50000 USD to 75000 USD. The income distribution gets lesser when salary range increases. Meaning there are less users who have high salary range.

Now Let's start exploring the transcript dataset. Find out the distribution of offer events.

It's clearly visible that the distribution of the events in the transcript dataset has two kind of events. 1 is offer type and the other is transactions.

There are almost 55% of records in the transcript dataset contains events with all offer related data i.e. 24.92% is offer received, 18.86% is offer viewed, 10.84% is offer completed. And 45% records are of transactions type data.

Hence it can be deduced that not all customers who received the offer viewed it and not all customers who viewed the offer completed it.

Now let's answer to our queries highlighted in the problem statement

Question-1. What percentage of offers viewed by the users from all the offers sent to them?

Out of 76277 offers sent 57725 offers were viewed by the users which is 75.68% of offers that is viewed by users from all the offers sent to them.

Question-2. How many offers from each offer type were sent to Customers?

To answer this question lets continue our data exploration and analysis. Let's begin with the column rename and merging of the data sets.

Since reward_given and the offer_reward columns are identicle for all records having offer completed (offer_completed==1). So let's drop the reward_given column.

Now let's look at the distribution of the offers sent to Customers

From the visualization, we can conclude almost same number of BOGO (Buy 1 Get 1) and Discount offers are sent to users which is almost double the number of Advertisement offers sent.

Now let's select records where transactions are occurred after receiving an offer and while that offer is valid.

To achieve this let's divide the dataframe into 3 groups based on offer received, offer viewed and offer completed. Then merge the data frames and add an offer_expiry column to display deadline for each offer ids.

Now we merge those dataframes on user_id and offer_id columns, to be able to compare the time between transactions.

Let's ensure no user completed an offer in the same hour twice. This will allow to remove duplicate values in the merged dataframe.

Let's ensure no user received an offer in the same hour twice. This will allow to remove duplicate values in the merged dataframe.

Now that we removed the duplicate entries, let's select users who completed the offer after viewing it and completed before it expires.

Lets compare this result with the original number of people who completed an offer through a visualization to answer the below question.

Question-3. What percentage of Customers completed an offer after viewing it and what percentage of Customers completed an offer without viewing it?

Based on the pie chart its evident that 71% of completed offers were made after users viewed them.

Let's look at the effect of advertisement on the transactions that users have made to answer our next questions.

Question-4. How many transactions were completed/influenced because of the sent offers?

Finally, let's combine both dataframes, to see overall effect of an offer, whether it's BOGO, Discount or Advertisement on the behaviour of the user.

So there are 33081 transactions i.e. 19.22% of the total transactions that are influenced by the offers.

Now let's dive in to find out the facts that are influenced by the offers.

Question-5. Find out the facts that are influenced by the offer?

a. To start with first let's find out how gender affects the amount spent in a transaction and which type of discount attracts which gender types.

Plot gender wise amount spent for the customers and Gender wise response towards offer rewards where personal info is available.

From the barchart visualization, its observed that Females spend on an average of 3.99 USD which is more than the average spend of males i.e 3.48 USD. And the other category seems to be spending the most with an average of USD 4.35. This is probably due to their low number compared to males and females.

The visualization on the right display how the behaviour of genders attract towards offer rewards. Males seems to respond more towards 2.0 USD, 3.0 USD and 5.0 USD offer rewards than Females where as Females respond more towards 10.0 USD offer reward compared to males.

From the other categories, because of the less in numbers, a conclusion cannot be made however the visualization suggests they are respond slighly more towards 5.0 rewards.

A similar representation of the gender behaviour towards offer response is plotted using a Heatmap.

b. Second Let's see spread of data for amount spent by the users who completed the offer** after viewing it and before it expires and for users who didn't complete the offer

From the left box plot, it's observed that customer who didn't provide any personal info tend to spend more per transaction for completed offers. However from the right box plot it's observed, customer who provided personal info tend to spend a lot more.

Since there are much outliers present for the amount spent by customers, hence the y axis is limitted to display the trend where the the spread is more.

c. Third find the coorelation between Age Groups and Amount spent by customers for the transactions with complted offers and without offers.

From the above scatter plot with fitted regression line, its observed that there is almost positive coorelation between the Average amount spent and age groups for both completed and uncomplete offers with a dense of data showed between the age group of late 50s to late 60s. People with higher ages tend to spend more on the transaction amount in both cases.

Since there are much outliers present for the amount spent by customers, hence the y axis is limitted to display the trend where the the spread is more.

d. Fourth, find the coorelation between Customer Income and Amount Spent by customers for the transactions with complted offers and without offers.

We can see that coorelation between Customer Income and Amount Spent by customers for both transactions with complted and not completed offers is positive. The amount spent per transaction is more when the user income is more, which is expected.

Since there are much outliers present for the amount spent by customers, hence the y axis is limitted to display the trend where the dense of data is more.

e. Fifth, let's see how age plays a role in responding to offer rewards.**

Based on the above visualization, it's noticiable that age doesn't play a big role in responding to offers rewards, meaning all age group people respond to the offers almost in a similar fashion.

Question-6. How customers responded to offers when different channels used for advertisement for both completed and uncomplete offers.

To start with first fitter data into two categories. First, channels used for completed offers and second channels used for not completed offers

From the left donut chart, we can see that users responded to the offers almost the same percentage for completed offers to all channels being used.

Data suggests the similar trend for customer response towards uncomplete offers.

Implementation and Data Modeling

Question-7. Predict whether a Customer will respond to an offer or not using RandomForestClassifier and LinearSVC ?

Now that we are done with the data exploration and analysis to answer the queries, lets build a model to predict the User Responses to see whether a user will respond to an offer or not. To achieve let's assign our target variable i.e offer_success column to be predicted to y, and assign features i.e user demographics variables to X.

Lets use RandomForestClassifier and LinearSVC modelling techniques to build our model and predict the user response.

Scale features

Train Our Classifier

Modeling using RandomForestClassifier

Feature importance by RandomForestClassifier

Predict test data using RandomForestClassifier

Data Modeling Using LinearSVC

Predict test data using LinearSVC

Model Evaluation and Validation

We use the classification report from sklearn to evaluate the model.

Model Evaluation for RandomForestClassifier

Based on the model evaluation, the performance of our RandomForestClassifier model prediction is as follows

Model Evaluation for LinearSVC

Based on the model evaluation, the performance of our lLinear SVC model prediction is as follows

Justification

Based on the analysis and exploration done, we identified how customer demographics and the offer rewards affect user response to offers/advertisements sent through various visualization.

First, we identified users for whom there are demographic information was missing, and we classified them into a separate group. There are 13% of the total users for which there is no demographic info available in the dataset. This helped us identify accurately the gender distribution in the dataset. We saw that males take up 57% of total users and females take up 41%, leaving 1% for others.

Then we saw income distribution for the users. Data suggests many users have an annual income in the range between 30000 USD and 50000 USD and majority of the customers having income in the range between 50000 USD to 75000 USD. 

Then we saw distribution of offer type events. Data contains 55% of records as offer type events and 45% records are of transactions type data.

After that we identified the breakdown for no of offers sent to customers. Almost same number of Buy 1 Get 1 and Discount offers were sent to users which is almost double the number of total Advertisement offers sent.

Then we identified , out of total 76277 offers sent, 57725 offers were viewed by the users which is 75.68% of offers that is viewed by users from all the offers sent to them. And 71% of completed offers were made after users viewed them leaving 29% completed the offers without viewing it.

After that from many of the visualization techniques, we derived the facts that are influenced by the offer such as 

a. how gender affects the amount spent in a transaction and which type of discount attracts which gender types.

b. spread of data for amount spent by users who completed the offer after viewing it and before it expires and for users who didn't complete the offer.

c. correlation between Age Groups and Amount spent by customers for the transactions with and without completed offers.

d. find the correlation between Customer Income and Amount Spent by customers for the transactions with and without completed offers.

e. how age plays a role in responding to offer rewards.

Then, we identified how customers responded to offers for the different advertisement channels used for both completed and uncomplete offers

Finally, we trained two supervised classification models that predicts whether an user will respond to an offer or not based on demographics and offer reward data. Both the models predicted user responses with an accuracy of 87%, F1-score of 0.93 by both models for those who won't respond to an offer, F1-score of 0.69 and 0.70 for identifying those who will respond to offers by RandomForestClassifier and LinearSVC respectively.

Random Forest creates as many trees on the subset of the data and combines the output of all the trees to reduces overfitting problem in decision trees and also reduces the variance and therefore improves the accuracy, hence this algorithm is chosen as one of the modelling technique for this project.

Since Linear SVC classifier is relatively faster and takes just one parameter to tune, hence we selected LinearSVC as our second modelling technique to predict user response as part of this project.

Since the f1-score and accuracy for both RandomForestClassifier and LinearSVC models are pretty much the same when tuned with the GridSearchCV, and since RandomForestClassifier is taking much time (around 1100 seconds) compared to LinearSVC when training the model, we can go ahead with the LinearSVC as our final model implementation to predict the user responses towards offer rewards.

Conclusion

Reflection

The problem that I chose to solve as part of this project is to build a model that predicts whether a customer will respond to an offer or not. The approach being used for solving this problem has mainly three steps.

The most interesting aspect of this project is the combination between different datasets, using predictive modeling techniques and analysis to provide better decisions and value to the business. The data exploration and wrangling steps were the longest and most challenging part. The toughest part of this entire analysis was to find right logic and strategy to answer the problem statements and conveying them with different visualization techniques.

Improvement